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Abstract 

Visual tracking usually requires an object appearance model that is robust to changing illumination, pose and other factors encountered 
in video. Many recent trackers utilize appearance samples in previous frames to form the bases upon which the object appearance model 
is built. This approach has the following limitations: (a) the bases are data driven, so they can be easily corrupted; and (b) it is difficult to 
robustly update the bases in challenging situations. 

In this paper, we construct an appearance model using the 3D discrete cosine transform (3D-DCT). The 3D-DCT is based on a set 
of cosine basis functions, which are determined by the dimensions of the 3D signal and thus independent of the input video data. In 
addition, the 3D-DCT can generate a compact energy spectrum whose high-frequency coefficients are sparse if the appearance samples are 
similar. By discarding these high-frequency coefficients, we simultaneously obtain a compact 3D-DCT based object representation and a 
signal reconstruction-based similarity measure (reflecting the information loss from signal reconstruction). To efficiently update the object 
representation, we propose an incremental 3D-DCT algorithm, which decomposes the 3D-DCT into successive operations of the 2D discrete 
cosine transform (2D-DCT) and ID discrete cosine transform (1D-DCT) on the input video data. As a result, the incremental 3D-DCT 
algorithm only needs to compute the 2D-DCT for newly added frames as well as the 1D-DCT along the third dimension, which significantly 
reduces the computational complexity. Based on this incremental 3D-DCT algorithm, we design a discriminative criterion to evaluate the 
likelihood of a test sample belonging to the foreground object. We then embed the discriminative criterion into a particle filtering framework 
for object state inference over time. Experimental results demonstrate the effectiveness and robustness of the proposed tracker. 

Index Terms 

Visual tracking, appearance model, compact representation, discrete cosine transform (DCT), incremental learning, template matching. 

I. Introduction 

Visual tracking of a moving object is a fundamental problem in computer vision. It has a wide range of applications including 
visual surveillance, human behavior analysis, motion event detection, and video retrieval. Despite much effort on this topic, it remains 
a challenging problem because of object appearance variations due to illumination changes, occlusions, pose changes, cluttered and 
moving backgrounds, etc. Thus, a crucial element of visual tracking is to use an effective object appearance model that is robust to 
such challenges. 

Since it is difficult to explicitly model complex appearance changes, a popular approach is to learn a low-dimensional subspace (e.g., 
eigenspace (T), 0), which accommodates the object's observed appearance variations. This allows the appearance model to reflect 
the time-varying properties of object appearance during tracking (e.g., learning the appearance of the object from multiple observed 
poses). By computing the sample-to-subspace distance (e.g., reconstruction error (T], 0), the approach can measure the information 
loss that results from projecting a test sample to the low-dimensional subspace. Using the information loss, the approach can evaluate 
the likelihood of a test sample belonging to the foreground object. Since the approach is data driven, it needs to compute the subspace 
basis vectors as well as the corresponding coefficients. 

Inspired by the success of subspace learning for visual tracking, we propose an alternative object representation based on the 3D 
discrete cosine transform (3D-DCT), which has a set of fixed projection bases (i.e., cosine basis functions). Using these fixed projection 
bases, the proposed object representation only needs to compute the corresponding projection coefficients (3D-DCT coefficients). 
Compared with incremental principal component analysis | T|, this leads to a much simpler computational process, which is more 
robust to many types of appearance change and enables fast implementation. 

The DCT has a long history in the signal processing community as a tool for encoding images and video. It has been shown to 
have desirable properties for representing video, many of which also make it a promising object representation for visual tracking in 
video: 

• As illustrated in Fig. [T] the DCT leads to a compact object representation with sparse transform coefficients if a signal is self- 
correlated in both spatial and temporal dimensions. This means that the reconstruction error induced by removing a subset of 
coefficients is typically small. Additionally, high-frequency image noise or rapid appearance changes are often isolated in a small 
number of coefficients; 

• The DCT's cosine basis functions are determined by the signal dimensions that are fixed at initialization. Thus, the DCT's cosine 
basis functions are fixed throughout tracking, resulting in a simple procedure of constructing the DCT-based object representation; 

• The DCT only requires single-level cosine decomposition to approximate the original signal, which again is computationally 
efficient and also lends itself to incremental calculation, which is useful for tracking. 

Our idea is simply to represent a new sample by concatenating it with a collection of previous samples to form a 3D signal, and 
calculating its coefficients in the 3D-DCT space with some high-frequency components removed. Since the 3D-DCT encodes the 
temporal redundancy information of the 3D signal, the representation can capture the correlation between the new sample and the 
previous samples. Given a compression ratio (derived from discarding some high-frequency components), if the new sample can still 
be effectively reconstructed with a relatively low reconstruction error, then it is correlated with the previous samples and is likely to 
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Fig. 1. Illustration of 3D-DCT's compactness. The left part shows a face image sequence, and the right part displays the corresponding energy 
spectrum of 3D-DCT. Clearly, it is seen from the right part that the energy spectrums of 3D-DCT are compact. 

be an object sample. The fact that every sample is represented by using the same cosine basis functions makes it very easy to perform 
the likelihood evaluations of samples. 

The DCT is not the only choice for compact representations using data-independent bases; others include Fourier and wavelet basis 
functions, which are also widely used in signal processing. The coefficients of these basis functions are capable of capturing the energy 
information at different frequencies. For example, both sine and cosine basis functions are adopted by the discrete Fourier transform 
(DFT) to generate the amplitude and phase frequency spectrums; wavelet basis functions (e.g., Haar and Gabor) aim to capture local 
detailed information (e.g., texture) of a signal at multiple resolutions by the wavelet transform (WT). Although we do not conduct 
experiments with these functions in this work, they can be used in our framework with only minor modification. 

Using the 3D-DCT object representation, we propose a discriminative learning based tracker. The main contributions of this tracker 
are three-fold: 

1) We utilize the signal compression power of the 3D-DCT to construct a novel representation of a tracked object. The representation 
retains the dense low-frequency 3D-DCT coefficients, and discards the relatively sparse high-frequency 3D-DCT coefficients. 
Based on this compact representation, the signal reconstruction error (measuring the information loss from signal reconstruction) 
is used to evaluate the likelihood of a test sample belonging to the foreground object given a set of training samples. 

2) We propose an incremental 3D-DCT algorithm for efficiently updating the representation. The incremental algorithm decomposes 
3D-DCT into the successive operations of the 2D-DCT and 1D-DCT on the input video data, and it only needs to compute the 
2D-DCT for newly added frames (referred to in Equ. ( |18| >) as well as the 1D-DCT along the third dimension, resulting in high 
computational efficiency. In particular, the cosine basis functions can be computed in advance, which significantly reduces the 
computational cost of the 3D-DCT. 

3) We design a discriminative criterion (referred to in Equ. {20)) for predicting the confidence score of a test sample belonging to 
the foreground object. The discriminative criterion considers both the foreground and the background 3D-DCT reconstruction 
likelihoods, which enables the tracker to capture useful discriminative information for adapting to complicated appearance 
changes. 

II. Related work 

Since our work focuses on learning compact object representations based on the 3D-DCT, we first discuss the DCT and its applications 
in relevant research fields. Then, we briefly review the related tracking algorithms using different types of object representations. As 
claimed in 0, [4|, the DCT aims to use a set of mutually uncorrelated cosine basis functions to express a discrete signal in a linear 
manner. It has a wide range of applications in computer vision, pattern recognition, and multimedia, such as face recognition 0, 
image retrieval |'6), 13 » video object segmentation |8|, video caption localization |9), etc. In these applications, the DCT is typically 
used for feature extraction, and aims to construct a compact DCT coefficient-based image representation that is robust to complicated 
factors (e.g., facial geometry and illumination changes). In this paper, we focus on how to construct an effective DCT-based object 
representation for robust visual tracking. 

In the field of visual tracking, researchers have designed a variety of object representations, which can be roughly classified into 
two categories: generative object representations and discriminative object representations. 

Recently, much work has been done in constructing generative object representations, including the integral histogram 1 10|, kernel 
density estimation (TT), mixture models lfl2l . fl3l , subspace learning QJ, fl4l . linear representation [15), |[T6l , (T7), (T8J, fl9l , 
visual tracking decomposition |20l , covariance tracking (21], 0, [22 1, and so on. Some representative tracking algorithms based on 
generative object representations are reviewed as follows. Jepson et al. fl3l design a more elaborate mixture model with an online EM 
algorithm to explicitly model appearance changes during tracking. Wang et al. lfT2ll present an adaptive appearance model based on the 
Gaussian mixture model in a joint spatial-color space. Comaniciu et al. [23 | propose a kernel-based tracking algorithm using the mean 
shift-based mode seeking procedure. Following the work of l23l . some variants of the kernel-based tracking algorithm are proposed, 
e.g., ifTTL l24l . l25l . Ross et al. 1 1 1 propose a generalized tracking framework based on the incremental PCA (principal component 
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analysis) subspace learning method with a sample mean update. A sparse approximation based tracking algorithm using i\ -regularized 
minimization is proposed by Mei and Ling (15) . To achieve a real-time performance, Li et al. llT8l present a compressive sensing i\ 
tracker using an orthogonal matching pursuit algorithm, which is up to 6000 times faster than 1 15 1. 

In contrast, another type of tracking algorithms try to construct a variety of discriminative object representations, which aim to 
maximize the inter-class separability between the object and non-object regions using discriminative learning techniques, including 
SVMs |26|, |27|, [28 1, |29|, boosting |30|, 1 3 1 1, discriminative feature selection |32|, random forest |33 |, multiple instance learning (34), 
spatial attention learning |35|, discriminative metric learning (36), (37) , data-driven adaptation (38) , etc. Some popular tracking 
algorithms based on discriminative object representations are described as follows. Grabner et al. (30) design an online AdaBoost 
classifier for discriminative feature selection during tracking, resulting in the robustness to the appearance variations caused by 
out-of-plane rotations and illumination changes. To alleviate the model drifting problem with |30|, Grabner et al. (3D present a 
semi- supervised online boosting algorithm for tracking. Liu and Yu (39) present a gradient-based feature selection mechanism for 
online boosting learning, leading to the higher tracking efficiency. Avidan (40) builds an ensemble of online learned weak classifiers 
for pixel-wise classification, and then employ mean shift for object localization. Instead of using single-instance boosting, Babenko et 
al. (34) present a tracking system based on online multiple instance boosting, where an object is represented as a set of image patches. 
Besides, SVM-based object representations have also attracted much attention in recent years. Based on off-line SVM learning, Avidan 
1 26 1 proposes a tracking algorithm for distinguishing a target vehicle from backgrounds. Later, Tian et al. (27) present a tracking 
system based on an ensemble of linear SVM classifiers, which can be adaptively weighted according to their discriminative abilities 
during different periods. Instead of using supervised learning, Tang et al. [28] present an online semi- supervised learning based tracker, 
which constructs two feature- specific SVM classifiers in a co-training framework. 

As our tracking algorithm is based on the DCT, we give a brief review of the discrete cosine transform and its three basic versions 
for ID, 2D, and 3D signals in the next section. 

III. The 3D-DCT for object representation 

We first give an introduction to the 3D-DCT in Section |III-A| Then, we derive and formulate the DCT's matrix forms (used for 
object representation) in Section [Tll-B| Next, we address the problem of how to use the 3D-DCT as a compact object representation 
in Section |III-C| Finally, we propose an incremental 3D-DCT algorithm to efficiently compute the 3D-DCT in Section |III-D| 

A. 3D-DCT definitions and notations 

The goal of the discrete cosine transform (DCT) is to express a discrete signal, such as a digital image or video, as a linear 
combination of mutually uncorrelated cosine basis functions (CBFs), each of which encodes frequency- specific information of the 
discrete signal. 

We briefly define the 1D-DCT, 2D-DCT, and 3D-DCT, which are applied to ID signal 2D signal (/n(x, y)) NlXN2 

and 3D signal (/m(x, y, z)) Nl xN2XNs respectively: 



iVi-l 
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where A; is a positive integer. 

The corresponding inverse DCTs (referred to as 1D-IDCT, 2D-IDCT, and 3D-IDCT) are defined as: 
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The low-frequency CBFs reflect the larger-scale energy information (e.g., mean value) of the discrete signal, while the high-frequency 
CBFs capture the smaller-scale energy information (e.g., texture) of the discrete signal. Based on these CBFs, the original discrete 
signal can be transformed into a DCT coefficient space whose dimensions are mutually uncorrected. Furthermore, the output of the 
DCT is typically sparse, which is useful for signal compression and also for tracking, as will be shown in the following sections. 



B. 3D-DCT matrix formulation 

Let Ci = (Ci(0), Ci(l), . . . , Ci(Ni — 1)) T denote the 1D-DCT coefficient column vector. Based on Equ. |T}, Ci can be rewritten 
in a matrix form: Ci = Aif, where f is a column vector: f = (/i(0), /i(l), • • • , fi(Ni — 1)) T and Ai = (o>i(u,x)) N N is a 
cosine basis matrix whose entries are given by: 

~tv(2x + l)u 



ai(u, x) — a\(u) cos 



2Ni 



(8) 



The matrix form of 1D-IDCT can be written as: f = A^Ci. Since Ai is an orthonormal matrix, f = A^Ci. 

The 2D-DCT coefficient matrix Cn = (Cu(u,v))n 1 xn 2 corresponding to Equ. (|2) is formulated as: Cn = AiFA^, where 
F = (/n(x, v))n 1 xn 2 is the original 2D signal, Ai is defined in Equ. {SJ, and A2 is defined as (a 2 (v, ?/))jv 2 xiv 2 such that 



a2(v,y) = a 2 (v) cos 



7r(2y + l)v" 
2N 2 



(9) 



The matrix form of the 2D-IDCT can be expressed as: F = A x 1 Cn(A^) 1 . Since the DCT basis functions are orthonormal, we 
have F = Aj C n A 2 . 

Similarly, the 3D-DCT can be decomposed into a succession of the 2D-DCT and 1D-DCT operations. Let T — (fm y, z))n 1 xn 2 xn 3 
denote a 3D signal. Mathematically, T can be viewed as a three-order tensor, i.e., T £ lZ Nl xN ? xN 3 m Consequently, we need to 
introduce terminology for the mode-m product defined in tensor algebra lEffl . Let B £ ^ix j 2x...xj m denote an M-order tensor, 
each element of which is represented as b(ii, . . . , i m . . . , im) with 1 < i m < 7 m . In tensor terminology, each dimension of a tensor 
is associated with a "mode". The mode-m product of the tensor B by a matrix <I> = (<l)(jm,im))j m xim is denoted as B x m <I> whose 
entries are as follows: 



(Sx m $) (ii,..., 



, IM) 



n, 



, lM)<l>(jm,in 



(10) 



where x m is the mode-m product operator and 1 < m < M. Given two matrices G £ 7Z JrnXlm and H £ 7Z JnXln such that m / n, 
the following relation holds: 

(11) 



(23 x m G) x n H = (23 x n H) x m G = 23x m G x n H. 



Based on the above tensor algebra, the 3D-DCT coefficient matrix Cm = (Cui(u,v,w))n 1 xn 2 xn 3 can be formulated as: Cm 
T Xi Ai x 2 A 2 x 3 A3, where A 3 = (as(w, z))n 3 xn 3 has a similar definition to Ai and A 2 : 



CLz(w, z) — OLz(w) COS 



vr(2^+ 
2iVs 



(12) 



Accordingly, 3D-IDCT is formulated as: T — Cm XiA 1 1 X2 A 2 ! X3 A 3 1 . Since Afc(l < k < 3) is an orthonormal matrix, T 
can be rewritten as: 



T = Cm Xi A! x 2 A 2 x 3 A 3 



(13) 



In fact, the 1D-DCT and 2D-DCT are two special cases of the 3D-DCT because ID vectors and 2D matrices are 1 -order and 2-order 
tensors, respectively, namely, f Xi Ai = Aif and Fxi Ai x 2 A 2 = AiFA^. 



C. Compact object representation using the 3D-DCT 

For visual tracking, an input video sequence can be viewed as 3D data, so the 3D-DCT is a natural choice for object representation. 
Given a sequence of normalized object image regions T — (fui(x, y, z)) NlXN<2XN3 from previous frames and a candidate image 
region (r(x,y)) 

n± x n 2 i n me curren t frame, we have a new image sequence T — (fm(x,y, z)) Ni xN<2 X (jv 3 +i) wnere ^ ne first N3 
images correspond to T and the last image (i.e., the (N3 + l)th image) is (r(x, y)) NlXN<2 - According to Equ. |T3] >, T can be 
expressed as: 

T = C m xi A^ x 2 Al x 3 (A^) T , (14) 

where C m £ n NlXN2X(Ns+1) is the 3D-DCT coefficient matrix: C m = T Xi Ai x 2 A 2 x 3 A 3 and A 3 £ ^ (iV3+1)x(iV3+1) is 
a cosine basis matrix whose entry is defined as: 



v , , , if ir — II: 

a 3 (w,z)= i V^2±i' (15) 

otherwise. 



V ^3 + 1 L 2(iV 3 + l) J 



According to the properties of the 3D-DCT, the larger the values of (u, v, w) are, the higher frequency the corresponding elements of 
C HI encode. Usually, the high-frequency coefficients are sparse while the low-frequency coefficients are relatively dense. Recently, PC A 
(principal component analysis) tracking 1 1 1 builds a compact subspace model which maintains a set of principal eigenvectors controlling 
the degree of structural information preservation. Inspired by PCA tracking 1 1 1, we compress the 3D-DCT object representation by 
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Algorithm 1: Incremental 3D-DCT for object representation. 
Input: 

• Cosine basis matrices Ai and A2 (whose values are fixed given Ni and N2) 

• Cosine basis matrices A 3 

• New image (r(x,y)) NlxN2 

• D = T Xi Ai x 2 A 2 of the previous image sequence T = (fui(x, y, z)) Nl xN2 xiYs 
begin 

1) Use the FFT to efficiently compute the 2D-DCT of r; 

2) Update D according to Equ. {18} ; 

3) Employ the FFT to efficiently obtain the 1D-DCT of D along the third dimension. 
Output: 

• 3D-DCT (i.e., C IU ) of the current image sequence T = (fm(x,y,z)) NlxN2X ( N3+1) 




Fig. 2. Comparison on the computational time between the normal 3D-DCT and our incremental 3D-DCT. The three subfigures correspond to different 
configurations of N± x N2 (i.e., 30 x 30, 60 x 60, and 90 x 90). In each subfigure, the x-axis is associated with N3; the y-axis corresponds to the 
computational time. Clearly, as N3 increases, the computational time of the normal 3D-DCT grows much faster than that of the incremental 3D-DCT. 



retaining the relatively low-frequency elements of C m around the origin, i.e., {(u,v,w)\u < S u ,v < S v ,w < S w }. As a result, we 
can obtain a compact 3D-DCT coefficient matrix Cf u . Then, T can be approximated by: 

T « T* = Cm XiA[x 2 A^ x 3 (A 3 ) T (16) 

Let J 7 * = (/in(x, y, ^))^ lX iY 2 x(A^3+i) denote the corresponding reconstructed image sequence of T . The loss of high frequency 
components introduces a reconstruction error \\r — /ni(:, :, N 3 + 1)||, which forms the basis of the likelihood measure, as shown in 
Section HV-Bl 

D. Incremental 3D-DCT 

Given a sequence of training images, we have shown how to use the 3D-DCT to represent an object for visual tracking, in Equ. (16} . 
As the object's appearance changes with time, it is also necessary to update the object representation. Consequently, we propose an 
incremental 3D-DCT algorithm which can efficiently update the 3D-DCT based object representation as new data arrive. 

Given a new image (T~(x,y)) NlXN2 and the transform coefficient matrix D = J Xi Ai X2 A2 G lZ Nl x N<2 x Na of previous 
images T — (/in(x, y, z)) NixN2XNs , the incremental 3D-DCT algorithm aims to efficiently compute the 3D-DCT coefficient matrix 
C HI G 7^ iv i xiV 2x(iV3+i) Q f ^ e p rev i ous images with the current image appended: T — (/ni(x, y, z)) Nl xN2 X (jv 3 +i) w ^ tn tne ^ ast 
image being (r(x, y)) N xN2 - Mathematically, C HI is formulated as: 

Cm = T xi Ai x 2 A 2 x 3 A3, (17) 

where A 3 e ^(^3+i)x(iV3+i) is referred t0 in Equ ^ In principle, Equ. £7) can be computed in the following two stages: 1) 
compute the 2D-DCT coefficients for each image, i.e., D — T X1A1X2 A2; and 2) calculate the 1D-DCT coefficients along the 
time dimension, i.e, C m = D X3 A 3 . 

According to the definition of the 3D-DCT, the CBF matrices Ai and A2 only depend on the row and column dimensions (i.e., Ni 
and N2), respectively. Since both Ni and A^2 are unchanged during visual tracking, both Ai and A2 remain constant. In addition, 
J 7 is a concatenation of T and (i~(x,y)) NixN2 along the third dimension. According to the property of tensor algebra, D can be 
decomposed as: 

D'(:,:,*) = ( D(: ' : A fe) ' a ^<k<Ns 

v J \ T Xl Al X 2 A 



X2A2, fe = JV 3 + l; (18) 



Given D, D ; can be efficiently updated by only computing the term rXiAi x 2 A 2 . Moreover, A 3 is only dependent on the variable 
A^3. Once A^3 is fixed, A 3 is also fixed. In addition, r Xi Ai X2 A2 can be viewed as the 2D-DCT along the first two dimensions 
(i.e., x and y); and C m = D X3 A 3 can be viewed as the 1D-DCT along the time dimension. To further reduce the computational 
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Algorithm 2: Incremental 3D-DCT object tracking. 



Input: New frame t, previous object state ZJ_ 1 , previous positive and negative sample sets: T+ = {fui( x ^y^ z ))N 1 xN 2 xN + anc * 

T - = OW^ 2/> z )) Nl xn 2 xN~ ' maximum buffer size T - 
Initialization: 



■t = 1. 

Manually set the initial object state ZJ\ 



IV-Al 



Collect positive (or negative) samples to form training sets = Zf and T- = Z~~ (see Section ; 
begin 

Sample V candidate object states {Ztj}J = i according to Equ. ( |2l) . 
Crop out the corresponding image regions {otj}J =1 of {Ztj}J=i- 
Resize each candidate image region o t j to N± x N2 pixels, 
for each Z t j do 

1) Find the K nearest neighbors e K NlXN2XK (or e n N i xN * xK ) of a candidate 
sample r (i.e., r = otj) from .F+ (or J 7 -). 

2) Obtain the 3D signals and J"_ through the concatenations of ,t) and (T^S ,t). 

3) Perform the incremental 3D-DCT in Algorithm |lj to compute the 3D-DCT coefficient matrices: Cj II+ and Cj H . 

4) Compute the compact 3D-DCT coefficient matrices CJ n and CJ n by discarding the high-frequency coefficients of 
C m+ andC m _. 

5) Calculate the reconstructed representations of and T_ as and J 7 ^ by Equ. (^6j. 

6) Compute the reconstruction likelihoods C T+ and C T _ using Equ. fl9) . 

7) Calculate the final likelihood £* using Equ. pO) . 

Determine the optimal object state Z£ by the MAP estimation (referr ed to i n Equ. <[22}). 
Select positive (or negative) samples Z+ (or Z~~) (referred to in Sec. |lV-A) . 
Update the training sample sets T+ and T- with T+ (J Z+ and J 7 - (j Z~~. 
JV 3 + = JV 3 + + |Z+ 1 and N~ = N~ + 
Maintain the positive and negative sample sets as follows: 

- If > T, then T+ is truncated to keep the last T elements. 

- If > T, then T- is truncated to keep the last T elements. 

Output: Current object state Z£ , updated positive and negative sample sets T+ and T- . 



time of the 1D-DCT and 2D-DCT, we employ a fast algorithm using the Fast Fourier Transform (FFT) to efficiently compute the 
DCT and its inverse O, (4). The complete procedure of the incremental 3D-DCT algorithm is summarized in Algorithm [T] 

The complexity of our incremental algorithm is 0(Ni A^log Ni + log A^) + N1N2N3 log N3) at each frame. In contrast, using a 
traditional batch-mode strategy for DCT computation, the complexity of the normal 3D-DCT algorithm becomes 0(Ni A^A^log Ni + 
logA^2 + \0gN3)). To illustrate the computational efficiency of the incremental 3D-DCT algorithm, Fig. [2] shows the computational 
time of the incremental 3D-DCT and normal 3D-DCT algorithms for different values of Ni, N2, and A^3. Although the computation 
time of both algorithms increases with A^3, the growth rate of the incremental 3D-DCT algorithm is much lower. 

IV. INCREMENTAL 3D-DCT BASED TRACKING 
In this section, we propose a complete 3D-DCT based tracking algorithm, which is composed of three main modules: 

• training sample selection: select positive and negative samples for discriminative learning; 

• likelihood evaluation: compute the similarity between candidate samples and the 3D-DCT based observation model; 

• motion estimation: generate candidate samples and estimate the object state. 

Algorithm [2] lists the workflow of the proposed tracking algorithm. Next, we will discuss the three modules in detail. 

A. Training sample selection 

Similar to |34|, we take a spatial distance-based strategy for training sample selection. Namely, the image regions from a small 
neighborhood around the object location are selected as positive samples, while the negative samples are generated by selecting the 
image regions which are relatively far from the object location. Specifically, we draw a number of samples Z t from Equ. j2T] l, and 
then an ascending sort for the samples from Z t is made according to their spatial distances to the current object location, resulting 
in a sorted sample set Zf . By selecting the first few samples from Z|, we have a subset Zf that is the final positive sample set, as 
shown in the middle part of Fig. [3] The negative sample set Z~" is generated in the area around the current tracker location, as shown 
in the right part of Fig. [3] 

B. Likelihood evaluation 

During tracking, each of positive and negative samples is normalized to Ni x A^2 pixels. Without loss of generality, we assume 
the numbers of the positive and negative samples to be N% and iV~~. The positive and negative sample sequences are denoted as 
= {fni( x ^y^ z )) NlxN2XN + and J 7 - = (/ m (x, y, z)) Ni xN<2XN - , respectively. Based on T+ and T- , we evaluate the likelihood 
of a candidate sample (r(x, yf) NlXN2 belonging to the foreground object. Since the appearance of T+ and T- is likely to vary 
significantly as time progresses, it is not necessary for the 3D-DCT to use all samples in T+ and T- to represent the candidate 
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Original image 



Positive samples 




Negative samples 



Fig. 3. Illustration of training sample selection. The left subfigure plots the bounding box corresponding to the current tracker location; the middle 
subfigure shows the selected positive samples; and the right subfigure displays the selected negative samples. Different colors are assoicated with 
different samples. 
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Fig. 4. Illustration of the process of computing the reconstruction likelihood between test images and training images using the 3D-DCT and 3D-IDCT. 



sample (r(x, y)) N xN2 - As pointed out by l42l . locality is more essential than sparsity because locality usually results in sparsity 
but not necessarily vice versa. As a result, a locality-constrained strategy is taken to construct a compact object representation using 
the proposed incremental 3D-DCT algorithm. 

Specifically, we first compute the if -nearest neighbors (referred to as T+ e n NlxN2XK and T- e n NlxN2XK ) of the candidate 
sample r from T+ and sort them by their sum-squared distance to r (as shown in the top-left part of Fig. |4j, and then utilize the 
incremental 3D-DCT algorithm to construct the compact object representation. Let T+ and T- denote the concatenations of (J 7 ^, r) 
and {T- ,t), respectively. Through the incremental 3D-DCT algorithm, the corresponding 3D-DCT coefficient matrices C ni and 
C m can be efficiently calculated. After discarding the high-frequency coefficients, we can obtain the corresponding compact 3D- 
DCT coefficient matrices Cj[ II+ and Ci n _ . Based on Equ. |T6] >, the reconstructed representations of T+ and T_ are obtained as T\ 
and J 7 !, respectively. We compute the following reconstruction likelihoods: 




where 7+ and 7- are two scaling factors, fin, (:, :, K + 1) and /in_ (:, :, if + 1) are respectively the last images of F+ and T*L . 
Figs. [4] and [5] illustrates the process of computing the reconstruction likelihood between test samples and training samples (i.e., car 
and face samples) using the 3D-DCT and 3D-IDCT. Based on C T+ and C T _ , we define the final likelihood evaluation criterion: 

C* T =p(C T+ - \C T _) (20) 

where A is a weight factor and p(x) — 1+ex ^_ x ^ is the sigmoid function. 

To demonstrate the discriminative ability of the proposed 3D-DCT based observation model, we plot a confidence map defined in 
the entire image search space (shown in Fig. [6fa)). Each element of the confidence map is computed by measuring the likelihood 
score of the candidate bounding box centered at this pixel belonging to the learned observation model, according to Equ. p0] ». For 
better visualization, £* is normalized to [0, 1]. After calculating all the normalized likelihood scores at different locations, we have 
a confidence map which is shown in Fig. |6jb). From Fig. |6fb), we can see that the confidence map has an obvious uni-modal peak, 
which indicates that the proposed observation model has a good discriminative ability in this image. 



IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 



8 




Fig. 5. Example of computing the likelihood scores between test images and training images. The left part shows the training image sequence; the 
top-right part displays the test images; the bottom-middle part exhibits the reconstructed images by 3D-DCT and 3D-IDCT; the bottom-right part plots 
the corresponding likelihood scores (computed by Equ. {19}). 



Original frame Confidence map 




(a) (b) 

Fig. 6. Demonstration of the discriminative ability of the 3D-DCT based object representation used by our tracker, (a) shows the original frame; and 
(b) displays a confidence map, each element of which corresponds to an image patch in the entire image search space. 



C. Motion estimation 

The motion estimation module is based on a particle filter [43 1 that is a Markov model with hidden state variables. The particle 
filter can be divided into the prediction and the update steps: 

p(Z t \O t - 1 ) oc yp(Zt|Zt-i)p(Z t -i|C?t-i)dZt-i, 

p(Z t \O t ) ocp(ot\Zt)p(Z t \Ot-i), 

where O t — {oi,...,ot} are observation variables, p(o t \ Z t ) denotes the observation model, and p(Z t \ Z t -i) represents the 
state transition model. For the sake of computational efficiency, we only consider the motion information in translation and scaling. 
Specifically, let Z t = (Xt,yt,St) denote the motion parameters including X translation, y translation, and scaling. The motion model 
between two consecutive frames is assumed to be a Gaussian distribution: 

p(Zt\Z t -i) =Af(Z t ;Zt_i,E), (21) 

where £ denotes a diagonal covariance matrix with diagonal elements: oy, and a%. For each state Z t , there is a corresponding 
image region o t that is normalized to Ni x A^2 pixels by image scaling. The likelihood p(o t \ Z t ) is defined as: p(o t \ Zt) oc £* 
where £* is defined in Equ. ( |20] l. Thus, the optimal object state Z* t at time t can be determined by solving the following maximum 
a posterior (MAP) problem: 

Z" t = argmaxp(Zt|e>t). (22) 
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V. EXPERIMENTS 

A. Data description and implementation details 

We evaluate the performance of the proposed tracker (referred to as ITDT) on twenty video sequences, which are captured in different 
scenes and composed of 8-bit grayscale images. In these video sequences, several complicated factors lead to drastic appearance changes 
of the tracked objects, including illumination variation, occlusion, out-of-plane rotation, background distraction, small target, motion 
blurring, pose variation, etc. In order to verify the effectiveness of the proposed tracker on these video sequences, a large number 
of experiments are conducted. These experiments have two main goals: to verify the robustness of the proposed ITDT in various 
challenging situations, and to evaluate the adaptive capability of ITDT in tolerating complicated appearance changes. 

The proposed ITDT is implemented in Matlab on a workstation with an Intel Core 2 Duo 2.66GHz processor and 3.24G RAM. 
The average running time of the proposed ITDT is about 0.8 second per frame. During tracking, the pixels values of each frame 
are normalized into [0,1]. For the sake of computational efficiency, we only consider the object state information in 2D translation 
and scaling in the particle filtering module, where the particle number is set to 200. Each particle is associated with an image patch. 
After image scaling, the image patch is normalized to Ni x N2 pixels. In the experiments, the parameters (ATi,AT 2 ) are chosen as 
(30,30). The scaling factors (7+, 7.) in Equ. (T5J are both set to 1.2. The weight factor A in Equ. ( |20| ) is set to 0.1. The number 
of nearest neighbors K in Algorithm [2] is chosen as 15. The parameter T (i.e., maximum buffer size) in Algorithm [2] is set to 500. 
These parameter settings remain the same throughout all the experiments in the paper. As for the user-defined tasks on different video 
sequences, these parameter settings can be slightly readjusted to achieve a better tracking performance. 

B. Competing trackers 

We compare the proposed tracker with several other state-of-the-art trackers qualitatively and quantitatively. The competing trackers 
are referred to as FraglQ (Fragment-based tracker flOl ). MILT^] (multiple instance boosting-based tracker 1331 ). VTE^] (visual tracking 
decomposition 1201 ). OAeQ (online AdaBoost 1301 ). IPC^ (incremental PCA (TJ), and L1T0 (£ ± tracker [__]). Furthermore, IPCA, 
VTD, and LIT make use of particle filters for state inference while FragT, MILT, and OAB utilize the strategy of sliding window 
search for state inference. We directly use the public source codes of FragT, MILT, VTD, OAB, IPCA, and LIT. In the experiments, 
OAB has two different versions, i.e., OAB1 and OAB5, which utilize two different positive sample search radiuses (i.e., r — 1 and 
r = 5 selected in the same way as |34|) for learning AdaBoost classifiers. 

We select these seven competing trackers for the following reasons. First, as a recently proposed discriminant learning-based 
tracker, MILT takes advantage of multiple instance boosting for object/non-object classification. Based on the multi-instance object 
representation, MILT is capable of capturing the inherent ambiguity of object localization. In contrast, OAB is based on online 
single-instance boosting for object/non-object classification. The goal of comparing ITDT with MILT and OAB is to demonstrate the 
discriminative capabilities of ITDT in handling large appearance variations. In addition, based on a fragment-based object representation, 
FragT is capable of fully capturing the spatial layout information of the object region, resulting in the tracking robustness. Based on 
incremental principal component analysis, IPCA constructs an eigenspace-based observation model for visual tracking. LIT converts 
the problem of visual tracking to that of sparse approximation based on £\ -regularized minimization. As a recently proposed tracker, 
VTD uses sparse principal component analysis to decompose the observation (or motion) model into a set of basic observation (or 
motion) models, each of which covers a specific type of object appearance (or motion). Thus, comparing ITDT with FragT, IPCA, 
LIT, and VTD can show their capabilities of tolerating complicated appearance changes. 

C. Tracking results 

Due to space limit, we only report tracking results for the eight trackers (highlighted by the bounding boxes in different colors) 
over representative frames of the first twelve video sequences, as shown in Figs. |7}fT"8] (the caption of each figure includes the name 
of its corresponding video sequence). Complete quantitative comparisons for all the twenty video sequences can be found in Tab. |l| 

As shown in Fig. [7] a man walks under a treillag^] Suffering from large changes in environmental illumination and head pose, 
VTD and OAB 5 start to fail in tracking the face after the 170th frame while OAB1, IPCA, MILT, and FragT break down after the 
182nd, 201st, 202nd, and 205th frames, respectively. LIT fails to track the face from the 252nd frame. In contrast to these competing 
trackers, the proposed ITDT is able to successfully track the face till the end of the video. 

Fig. [i] shows that a tiger toy is shaken strongl)[j Affected by drastic pose variation, illumination change, and partial occlusion, LIT, 
IPCA, OAB5, and FragT fail in tracking the tiger toy after the 72nd, 114th, 154th, and 224th frames, respectively. From the 113th 
frame, VTD fails to track the tiger toy intermittently. OAB1 is not lost in tracking the tiger toy, but it achieves inaccurate tracking 
results. In contrast, both MILT and ITDT are capable of accurately tracking the tiger toy in the situations of illumination changes and 
partial occlusions. 

As shown in Fig. [9] there is a car moving quickly in a dark road scene with background clutter and varying lighting condition^] 
After the 271st frame, VTD fails to track the car due to illumination changes. Distracted by background clutter, MILT, FragT, LIT, 

1 http ://www.cs. technion . ac . il/ ~ amita/fragtrack/fragtrack. htm 

2 http://vision.ucsd.edu/~bbabenko/project_miltrack.shtml 

3 http://cv.snu.ac.kr/research/~vtd/ 

4 http : //www. vision . ee . ethz . ch/boostingTrackers/download .htm 

5 http : //w w w. cs . utoronto . ca/~ dros s/ivt/ 

6 http : // w ww. ist .temple . edu/ ~hbling 

7 Downloaded from http://www.cs.toronto.edu/~dross/ivt/ 

8 Downloaded from http://vision.ucsd.edu/~bbabenko/project_miltrack.shtml 

9 Downloaded from http://www.cs.toronto.edu/~dross/ivt/ 
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Fig. 7. The tracking results of the eight trackers over the representative frames (i.e., the 197th, 237th, 275th, 295th, 311th, 347th, 376th, and 433rd 
frames) of the "trellis70" video sequence in the scenarios with drastic illumination changes and head pose variations. 
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Fig. 8. The tracking results of the eight trackers over the representative frames (i.e., the 1st, 72nd, 146th, 285th, 291st, and 316th frames) of the 
"tiger" video sequence in the scenarios with partial occlusion, illumination change, pose variation, and motion blurring. 



and OAB1 break down after the 196th, 208th, 286th, and 295th frames, respectively. OAB5 can keep tracking the car, but obtain 
inaccurate tracking results. In contrast, only ITDT and IPCA succeed in accurately tracking the car throughout the video sequence. 

shows that several deer run and jump in a riveip^] Because of drastic pose variation and motion blurring, FragT fails in 



Fig. 



10 



tracking the head of a deer after the 5th frame while IPCA, VTD, OAB1, and OAB5 lose the head of the deer after the 13th, 17th, 
39th, and 52nd frames, respectively. LIT and MILT are incapable of accurately tracking the head of the deer all the time, and lose 
the target intermittently. Compared with these trackers, the proposed ITDT is able to accurately track the head of the deer throughout 
the video sequence. 

In the video sequence shown in Fig. [TT] several persons walk along a corridoij^] One person is occluded severely by the other two 
persons. All the competing trackers except for FragT and ITDT suffer from severe occlusion taking place between the 56th frame and 
the 76th frame. As a result, they fail to track the person after the 76th frame thoroughly. On the contrary, FragT and ITDT can track 
the person successfully. However, FragT achieves less accurate tracking results than ITDT. 

shows that woman with varying body poses walks along a pavemenp] In the meantime, her body is occluded by several 



Fig. 



12 



Downloaded from http://cv.snu.ac.kr/research/~vtd/ 



10 

11 Downloaded from http://homepages.inf.ed.ac.uk/rbf/caviardatal/ 
1 2 Downloaded from http ://w w w. c s . technion . ac . ill ~ amita/fragtrack/f ragtrack. htm 
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Fig. 9. The tracking results of the eight trackers over the representative frames (i.e., the 1st, 303rd, 335th, 362nd, 386th, and 388th frames) of the 
"car 11" video sequence in the scenarios with varying lighting conditions and background clutters. 
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Fig. 10. The tracking results of the eight trackers over the representative frames (i.e., the 1st, 40th, 43rd, 56th, 57th, and 59th frames) of the "animal" 
video sequence in the scenarios with motion blurring and background distraction. 



cars. After the 127th frame, MILT, OAB1, IPCA, and VTD start to drift away from the woman as a result of partial occlusion. LIT 
begins to lose the woman after the 147th frame while OAB5 fails to track the woman from the 205th frame. From the 227th frame, 
FragT stays far away from the woman. Only ITDT can keep tracking the woman over time. 

In the video sequence shown in Fig. [13] a number of soccer players assemble together and scream excitedly, jumping up and 
dowrjf] Moreover, their heads are partially occluded by many pieces of floating paper. FragT, IPCA, MILT, and OAB5 fail to track 
the face from the 49th, 52nd, 49th, and 87th frames, respectively. From the 48th frame to the 94th frame, VTD and OAB1 achieve 
unsuccessful tracking performances. After the 94th frame, they capture the location of the face again. Compared with these competing 
trackers, the proposed ITDT can achieve good performance throughout the video sequence. 



In Fig. 



14] several small- sized cars densely surrounded by other cars move in a blurry traffic scen^] Due to the influence of 
background distraction and small target, MILT, OAB5, FragT, OAB1, VTD, and LIT fail to track the car from the 69th, 160th, 190th, 
196th, 246th, and 314th frames, respectively. In contrast, both ITDT and IPCA are able to locate the car accurately at all times. 

15 a driver tries to parallel park in the gap between two car|^] At the end of the video sequence, the car is 



As shown in Fig. 



13 Downloaded from I http ://cv. snu. ac .kr/research/^ vtd/| 

14 Downloaded from http://i21www.ira.uka.de/image_sequences/ 

15 Downloaded from http://www.hitech-projects.com/euprojects/cantata/datasets_cantata/dataset.html 
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partially occluded by another car. FragT, VTD, OAB1, and IPCA achieve inaccurate tracking performances after the 122nd frame. 
Subsequently, they begin to drift away after the 435th frame, while OAB5 begins to break down from the 486th frame. MILT and 
LIT are able to track the car, but achieve inaccurate tracking results. In contrast to these competing trackers, the proposed ITDT is 
able to perform accurate car tracking throughout the video. 

Fig. [16] shows that two balls are rolled on the floor. In the middle of the video sequence, one ball is occluded by the other ball. 
LIT, FragT and VTD fail in tracking the ball in the 3rd, 5th, and 6th frames, respectively. Before the 8th frame, OAB1, OAB5, MILT, 
and IPCA achieves inaccurate tracking results. After that, IPCA fails to track the ball thoroughly while OAB1, OAB5, and MILT are 
distracted by another ball due to severe occlusion. In contrast, only ITDT can successfully track the ball continuously even in the case 
of severe occlusion. 

In the video sequence shown in Fig. 17 a girl rotates her body drasticall)[^] At the end, her face is occluded by the other person's 



face. Suffering from severe occlusion, IPCA fails to track the face from the 442nd frame while OAB5 begins to break down after the 
486th frame. Due to the influence of the head's out-of-plane rotation, MILT, OAB1, OAB5, FragT, and LIT obtain inaccurate tracking 
results from the 88th frame to the 265th frame. VTD can track the face persistently, but achieves inaccurate tracking results in most 



6 Downloaded from |http://vision.ucsd.edu/^bbabenko/project_miltrack.shtml| 
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Fig. 13. The tracking results of the eight trackers over the representative frames (i.e., the 1st, 53rd, 70th, 72nd, 79th, and 83rd frames) of the "soccer" 
video sequence in the scenarios with partial occlusions, head pose variations, background clutters, and motion blurring. 
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Fig. 14. The tracking results of the eight trackers over the representative frames (i.e., the 218th, 274th, and 314th frames) of the "video-car" video 
sequence in the scenarios with small target and background clutter. 

frames. On the con trary , the proposed ITDT can achieve accurate tracking results throughout the video sequence. 

As shown in Fig. |l 8[ a car is moving in a highwa>[^] Due to the influence of both shadow disturbance and pose variation, OAB5 and 
OAB1 fail to track the car thoroughly after the 241st and 331st frames, respectively. In contrast, VTD is able to track the car before 
the 240th frame. However, it tracks the car inaccurately or unsuccessfully after the 240th frame. MILT begin to achieve inaccurate 
tracking results after the 323rd frame. In contrast, ITDT can track the car accurately in the situations of shadow disturbance and pose 
variation throughout the video sequence, while both IPC A and LIT achieve less accurate tracking results than ITDT. 

D. Quantitative comparison 

1) Evaluation criteria: For all the twenty video sequences, the object center locations are labeled manually and used as the ground 
truth. Hence, we can quantitatively evaluate the performances of the eight trackers by computing their pixel-based tracking location 
errors from the ground truth. 

In order to better evaluate the quantitative tracking performance of each tracker, we define a criterion called the tracking success 

rate (TSR) as: TSR = jf-. Here N is the total number of the frames from a video sequence, and N s is the number of the frames 

in which a tracker can successfully track the target. The larger the value of TSR is, the better performance the tracker achieves. 

TT F 

Furthermore, we introduce an evaluation criterion to determine the success or failure of tracking in each frame: mSLX (w h) ' wnere 

TLE is the pixel-based tracking location error with respect to the ground truth, W is the width of the ground truth bounding box 

TT F 

for object localization, and H is the height of the ground truth bounding box. If m3lX ( W h) < 0-25, the tracker is considered to be 
successful; otherwise, the tracker fails. For each tracker, we compute its corresponding TSRs for all the video sequences. These TSRs 
are finally used as the criterion for the quantitative evaluation of each tracker. 

2 ) Investigation of nearest neighbor construction: The K nearest neighbors used in our 3D-DCT representation are always ordered 
according to their distances to the current sample (as described in Sec. |IV-B| ). In order to examine the influence of sorting such K 
nearest neighbors, we randomly exchange a few of them and perform the tracking experiments again, as shown in Fig. [19] It is seen 
from Fig. [19] that the tracking performances using different ordering cases are close to each other. 



Downloaded from http://www.cs.toronto.edu/~dross/ivt/ 
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Fig. 15. The tracking results of the three best trackers (i.e., ITDT, MILT, and LIT for a better visualization) over the representative frames (i.e., the 
2nd, 46th, 442nd, 460th, 482nd, and 493rd frames) of the "pets-car" video sequence in the scenarios with partial occlusion and car pose variation. 
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Fig. 16. The tracking results of the eight trackers over the representative frames (i.e., the 1st, 8th, 9th, 11th, 13th, and 15th) of the "TwoBalls" video 
sequence in the scenarios with severe occlusions and motion blurring. 



In order to evaluate the effect of nearest neighbor selection, we conduct one experiment on three video sequences using difference 



choices of K such that K £ {9, 11, 13, 15, 17, 19, 21}, as shown in Fig. 20 From Fig. 20 we can see that the tracking performances 
using different configurations of K within a certain range are close to each other. Therefore, our 3D-DCT representation is not very 
sensitive to the choice of K which lies in a certain interval. 

3) Comparison of object representation and state inference: From Tab. [i] we see that our tracker achieves equal or higher tracking 
accuracies than the competing trackers in most cases. Moreover, our tracker utilizes the same state inference method (i.e., particle 
filter) as IPCA, LIT, and VTD. Consequently, our 3D-DCT object representation play a more critical role in improving the tracking 
performance than those of IPCA, LIT, and VTD. 

Furthermore, we make a performance comparison between our particle filter-based method (referred to as "3D-DCT + Particle 
Filter") and a simple state inference method (referred to as "3D-DCT + Sliding Window Search"). Clearly, Fig. [21] shows that the 
tracking performances of two state inference methods are close to each other. Besides, Tab. [I] shows that our "3D-DCT + Particle 
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Fig. 17. The tracking results of the three best trackers (i.e., ITDT, LIT, and VTD for a better visualization) over the representative frames (i.e., the 1 12th, 
194th, 237th, 312th, 442nd, 460th, 464th, and 468th frames) of the "girl" video sequence in the scenarios with severe occlusion, in-plane/out-of -plane 
rotation, and head pose variation. 
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Fig. 18. The tracking results of the eight trackers over the representative frames (i.e., the 237th, 304th, 313th, 324th, 485th, and 553rd frames) of 
the "car4" video sequence in the scenarios with shadow disturbance and pose variation. 



Sub-three-persons 



Fig. 19. Quantitative tracking performances using different cases of "temporal ordering" (obtained by small-scale random permutation) on the four 
video sequences. The error curves of the four video sequences in this figure have the same y-axis scale as those of the four video sequences in Fig. [22] 



Filter" obtains more accurate tracking results than those of MILT and OAB, which also use a sliding window for state inference. 
Therefore, we conclude that the 3D-DCT object representation is mostly responsible for the enhanced tracking performance relative 
to MILT and OAB. 

4) Comparison of competing trackers: Fig. [22] plots the tracking location errors (highlighted in different colors) obtained by the 
eight trackers for the first twelve video sequences. Furthermore, we also compute the mean and standard deviation of the tracking 
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Fig. 20. Quantitative tracking performances using different choices of K on the three video sequences. The error curves of the three video sequences 
in this figure have the same y-axis scale as those of the three video sequences in Fig. [22] 
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Fig. 21. Quantitative tracking performances of different state inference methods, i.e., sliding window search-based object tracking (referred to as 
"3D-DCT+Sliding Window Search") and its comparison with particle filter-based tracking (referred to as "3D-DCT + Particle Filter") on the three 
video sequences. The error curves of the three video sequences in this figure have the same y-axis scale as those of the three video sequences in 
Fig. [22] and the supplementary file. Clearly, their tracking performances are almost consistent with each other. 



location errors for the first twelve video sequences, and report the results in Fig. [23] 

Moreover, Tab. [I] reports all the corresponding TSRs of the eight trackers over the total twenty video sequences. From Tab. [I] we 
can see that the mean and standard deviation of the TSRs obtained by the proposed ITDT is respectively 0.9802 and 0.0449, which 
are the best among all the eight trackers. The proposed ITDT also achieves the largest TSR over 19 out of 20 video sequences. As 
for the "surfer" video sequence, the proposed ITDT is slightly inferior to the best MILT (i.e., 1.33% difference). We believe this is 
because in the "surfer" video sequence, the tracked object (i.e., the surfer's head) has an low-resolution appearance with drastic motion 
blurring. In addition, the surfer's body has a similar color appearance to the tracked object, which usually leads to the distraction 
of the trackers using color information. Furthermore, the tracked object's appearance is varying greatly due to the influence of pose 
variation and out-of-plane rotation. Under such circumstances, the trackers using local features are usually more effective than those 
using global features. Therefore, the MILT using Haar-like features slightly outperforms the proposed ITDT using color features in 
the "surfer" video sequence. In summary, the 3D-DCT based object representation used by the proposed ITDT is able to exploit 
the correlation between the current appearance sample and the previous appearance samples in the 3D-DCT reconstruction process, 
and encodes the discriminative information from object/non-object classes. This may have contributed to the tracking robustness in 
complicated scenarios (e.g., partial occlusions and pose variations). 
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Fig. 22. The tracking location error plots obtained by the eight trackers over the first twelve videos. In each sub-figure, the x-axis corresponds to the 
frame index number, and the y-axis is associated with the tracking location error. 




Fig. 23. The quantitative comparison results of the eight trackers over the first twelve videos. The figure reports the mean and standard deviation of 
their tracking location errors over the first twelve videos. In each sub-figure, the x-axis shows the competing trackers, the y-axis is associated with the 
means of their tracking location errors, and the error bars correspond to the standard deviations of their tracking location errors. 
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TABLE I 

THE QUANTITATIVE COMPARISON RESULTS OF THE EIGHT TRACKERS OVER THE TWENTY VIDEO SEQUENCES. THE TABLE REPORTS THEIR 

TRACKING SUCCESS RATES (I.E., TSRS) OVER EACH VIDEO SEQUENCE. 





FragT 


VTD 


MILT 


OAB1 


OAB5 


IPCA 


LIT 


ITDT 


trellis70 


0.2974 


0.4072 


0.3493 


0.2295 


0.0339 


0.3593 


0.3972 


1.0000 


tiger 


0.1672 


0.5205 


0.9495 


0.2808 


0.1767 


0.1104 


0.1451 


0.9495 


carll 


0.4020 


0.4326 


0.1043 


0.3181 


0.2799 


0.9211 


0.5700 


0.9898 


animal 


0.1408 


0.0845 


0.6761 


0.3099 


0.5352 


0.1690 


0.5352 


0.9859 


sub-three-persons 


1.0000 


0.4610 


0.4481 


0.4610 


0.2662 


0.4481 


0.4481 


1.0000 


woman 


0.2852 


0.2004 


0.2058 


0.2148 


0.1859 


0.2148 


0.2509 


0.9530 


soccer 


0.1078 


0.3824 


0.2941 


0.3725 


0.4118 


0.4902 


0.9510 


1.0000 


video-car 


0.4711 


0.6353 


0.1550 


0.4225 


0.0578 


1.0000 


0.9058 


1.0000 


pets-car 


0.2959 


0.4062 


0.8801 


0.1799 


0.1199 


0.4081 


0.6983 


1.0000 


two-balls 


0.1250 


0.2500 


0.3125 


0.3125 


0.3750 


0.5625 


0.1250 


1.0000 


girl 


0.6335 


0.9044 


0.2211 


0.1773 


0.1633 


0.8466 


0.8845 


0.9741 


car4 


0.4139 


0.3783 


0.4849 


0.4547 


0.2327 


0.9982 


1.0000 


1.0000 


shaking 


0.1534 


0.2767 


0.9918 


0.9890 


0.8438 


0.0110 


0.0411 


0.9973 


pktest02 


0.1667 


1.0000 


1.0000 


1.0000 


0.2333 


1.0000 


1.0000 


1.0000 


davidin300 


0.4545 


0.7900 


0.9654 


0.3550 


0.4762 


1.0000 


0.8528 


1.0000 


surfer 


0.2128 


0.4149 


0.9894 


0.3112 


0.0399 


0.4069 


0.2766 


0.9761 


singer2 


0.9304 


1.0000 


1.0000 


0.3783 


0.2087 


1.0000 


0.6739 


1.0000 


seq-jd 


0.8020 


0.7723 


0.5545 


0.5446 


0.3168 


0.6634 


0.2277 


0.8020 


cubicle 


0.7255 


0.9020 


0.2353 


0.4706 


0.8627 


0.7255 


0.6863 


1.0000 


seq- simultaneous 


0.6829 


0.3171 


0.2927 


0.6829 


0.6585 


0.3171 


0.5854 


0.9756 


mean 


0.4234 


0.5268 


0.5555 


0.4233 


0.3239 


0.5826 


0.5629 


0.9802 


s.t.d. 


0.2817 


0.2768 


0.3382 


0.2315 


0.2438 


0.3360 


0.3126 


0.0449 
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VI. Conclusion 

In this paper, we have proposed an effective tracking algorithm based on the 3D-DCT. In this algorithm, a compact object represen- 
tation has been constructed using the 3D-DCT, which can produce a compact energy spectrum whose high-frequency components are 
discarded. The problem of constructing the compact object representation has been converted to that of how to efficiently compress 
and reconstruct the video data. To efficiently update the object representation during tracking, we have also proposed an incremental 
3D-DCT algorithm which decomposes the 3D-DCT into the successive operations of the 2D-DCT and 1D-DCT on the video data. 
The incremental 3D-DCT algorithm only needs to compute 2D-DCT for newly added frames as well as the 1D-DCT along the time 
dimension, leading to high computational efficiency. Moreover, by computing and storing the cosine basis functions beforehand, we 
can significantly reduce the computational complexity of the 3D-DCT. Based on the incremental 3D-DCT algorithm, a discriminative 
criterion has been designed to measure the information loss resulting from 3D-DCT based signal reconstruction, which contributes to 
evaluating the confidence score of a test sample belonging to the foreground object. Since considering both the foreground and the 
background reconstruction information, the discriminative criterion is robust to complicated appearance changes (e.g., out-of-plane 
rotation and partial occlusion). Using this discriminative criterion, we have conducted visual tracking in the particle filtering framework 
which propagates sample distributions over time. Compared with several state-of-the-art trackers on challenging video sequences, the 
proposed tracker is more robust to the challenges including illumination changes, pose variations, partial occlusions, background 
distractions, motion blurring, complicated appearance changes, etc. Experimental results have demonstrated the effectiveness and 
robustness of the proposed tracker. 
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